Version: 25.08

Predefined Content Identifier Rules

Cyberhaven provides an extensive library of predefined Content Identifier Rules that form the foundation of the content inspection system. These rules are built using the Nucleuz Classification Engine and are designed to detect specific data types, patterns, and sensitive information with high accuracy and performance.

Overview

The Content matching rules page under Preferences provides a unified interface for you to define content inspection rules.

On the Rules tab, you can:

View all predefined and custom rules
Create new custom rules
Delete custom rules
Enable/Disable predefined and custom rules

Note

Predefined rules cannot be deleted.
Rules that are currently applied in a policy or dataset cannot be deleted.

Predefined Rules Library

Cyberhaven includes a comprehensive library of predefined classification rules that detect sensitive data types commonly found in organizations worldwide. These rules are organized into logical groupings and cover various data protection requirements, regulatory compliance needs, and industry-specific data types.

Rule Categories

The predefined rules are organized into several high-level categories:

Personal Identifiers

Social Security Numbers: Various national formats with validation
National ID Numbers: Country-specific identification formats
Passport Numbers: International passport number patterns
Driver License Numbers: Regional license number formats
Tax Identification Numbers: Country-specific tax ID patterns

Financial Data

Credit Card Numbers: Multiple card types with Luhn algorithm validation
Bank Account Numbers: Various national and international formats
IBAN Numbers: International Bank Account Number validation
Routing Numbers: Banking routing and transit numbers
SWIFT Codes: Bank identifier codes

Healthcare Data

Medical Record Numbers: Healthcare identifier formats
Drug Enforcement Agency (DEA) Numbers: US pharmaceutical tracking
National Provider Identifiers: Healthcare provider IDs
Health Insurance Numbers: Medical insurance identifiers
Patient Identifiers: Various healthcare system IDs

Communication Data

Email Addresses: Various email format patterns
Phone Numbers: International and domestic phone formats
IP Addresses: IPv4 and IPv6 address patterns
URLs and Domains: Web address patterns
MAC Addresses: Network hardware identifiers

Government and Legal

Voter Registration Numbers: Electoral system identifiers
Court Case Numbers: Legal system case identifiers
License Numbers: Professional and business licenses
Permit Numbers: Government permit identifiers
Registration Numbers: Various government registrations

Authentication and Security

API Keys: Various API key patterns
Access Tokens: Authentication token formats
Passwords: Password pattern detection
Cryptographic Keys: Encryption key patterns
Certificates: Digital certificate identifiers

Rule Structure and Components

Each predefined rule contains multiple components that work together to accurately detect sensitive data:

Pattern Matching

Regular Expressions: Sophisticated regex patterns for format detection
Format Validation: Specific formatting requirements (e.g., XXX-XX-XXXX)
Length Constraints: Minimum and maximum character limits
Character Sets: Allowed characters and encoding requirements

Validation Functions

Checksum Algorithms: Mathematical validation (e.g., Luhn algorithm for credit cards)
Format Verification: Structural validation of data patterns
Range Validation: Numeric range checking where applicable
Cross-Reference Validation: Verification against known valid patterns

Context Analysis

Supporting Keywords: Contextual terms that increase confidence
Proximity Analysis: Related terms within specified distance
Document Structure: Location-based context (headers, forms, etc.)
Language Support: Multilingual keyword recognition

Confidence Scoring

Base Confidence: Initial confidence based on pattern match
Context Boost: Additional confidence from supporting evidence
Validation Confirmation: Confidence increase from successful validation
Threshold Management: Configurable confidence thresholds

Rule Performance Characteristics

Detection Accuracy

High Precision: Minimized false positives through validation
Comprehensive Coverage: Multiple patterns for format variations
Contextual Awareness: Reduced false positives through context analysis
Adaptive Thresholds: Configurable sensitivity levels

Processing Efficiency

Optimized Patterns: Regular expressions tuned for performance
Parallel Processing: Rules designed for concurrent execution
Memory Efficiency: Optimized memory usage patterns
Scalable Architecture: Performance maintained at scale

Regional Adaptations

Localized Patterns: Country-specific data formats
Language Support: Multilingual keyword recognition
Cultural Context: Region-appropriate detection patterns
Regulatory Alignment: Compliance with local data protection laws

Example Rule Types

Pattern: XXX-XX-XXXX format detection
Validation: Area number and group number validation
Context: Keywords like "SSN", "Social Security", "Tax ID"
Confidence: High confidence with validation, medium without

Credit Card Numbers

Pattern: 13-19 digit sequences with optional separators
Validation: Luhn algorithm checksum verification
Context: Keywords like "card", "credit", "payment"
Types: Visa, MasterCard, American Express, Discover, etc.

Email Addresses

Pattern: Local@domain format with RFC compliance
Validation: Domain structure and character validation
Context: Communication-related keywords
Variations: Multiple format variations and international domains

IBAN Numbers

Pattern: Country code + check digits + account identifier
Validation: MOD-97 checksum algorithm
Context: Banking and financial keywords
Coverage: All IBAN-participating countries

Rule Management

Enabling Rules

Enable the predefined and custom rules you want to use for content inspection. Cyberhaven's content inspection engines will analyze content using the enabled rules to identify sensitive data patterns.

Rule Limitations

Note

You cannot disable rules currently in use within a policy or dataset.
There is a limitation on the total number of rules that can be enabled simultaneously, which depends on system resources and performance requirements.

Selection Guidelines

When selecting predefined rules:

Data Relevance: Choose rules that match the types of sensitive data in your environment
Regional Requirements: Select rules appropriate for your geographic regions
Regulatory Compliance: Include rules required for applicable compliance frameworks
Performance Impact: Consider the cumulative processing overhead of enabled rules
Accuracy Requirements: Balance comprehensive coverage with acceptable false positive rates

Policy Association

Predefined rules are used within Content Identifier Policies to:

Define Detection Scope: Specify which data types to detect
Set Confidence Thresholds: Configure sensitivity levels
Combine Multiple Rules: Create comprehensive detection policies
Enable Contextual Detection: Leverage supporting evidence

Performance Considerations

Resource Usage

CPU Impact: Processing overhead varies by rule complexity
Memory Requirements: Rules consume system memory during execution
I/O Considerations: Content scanning affects storage and network performance
Scalability: Performance impact scales with content volume and rule count

Optimization Strategies

Selective Enablement: Enable only necessary rules for your environment
Threshold Tuning: Adjust confidence thresholds to balance accuracy and performance
Rule Prioritization: Focus on high-value data types first
Performance Monitoring: Track system performance with different rule configurations

Overview​

Predefined Rules Library​

Rule Categories​

Personal Identifiers​

Financial Data​

Healthcare Data​

Communication Data​

Government and Legal​

Authentication and Security​

Rule Structure and Components​

Pattern Matching​

Validation Functions​

Context Analysis​

Confidence Scoring​

Rule Performance Characteristics​

Detection Accuracy​

Processing Efficiency​

Regional Adaptations​

Example Rule Types​

Social Security Number (US)​

Credit Card Numbers​

Email Addresses​

IBAN Numbers​

Rule Management​

Enabling Rules​

Rule Limitations​

Selection Guidelines​

Policy Association​

Performance Considerations​

Resource Usage​

Optimization Strategies​